Question sets

Multivariate discrete distributions

Question 1: Categorical distribution with Dirichlet prior

Let \(X = \{x_1, \dots, x_N\}\) be \(N\) independent observations following a Categorical distribution with \(K\) classes and parameter vector \(\mathbf{p} = (p_1, \dots, p_K)\), where \(\sum p_k = 1\). Let the prior on \(\mathbf{p}\) be a Dirichlet distribution with hyperparameters \(\boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K)\). Let \(n_k\) denote the number of observations in class \(k\).

Show that the posterior distribution of the parameter vector \(\mathbf{p}\) follows a Dirichlet distribution: \[ \mathbf{p} | X \sim \text{Dir}(\alpha_1 + n_1, \dots, \alpha_K + n_K) \]

Show that the posterior predictive probability of a new observation \(\tilde{x}\) belonging to class \(k\) is:

\[ P(\tilde{x} = k | X) = \frac{\alpha_k + n_k}{\sum_{j=1}^K \alpha_j + N} \]

Question 2: Multinomial distribution with Dirichlet prior

Let \(\mathbf{X}\) be a vector of counts \((n_1, \dots, n_K)\) resulting from \(N\) independent trials, modeled by a Multinomial distribution with parameter vector \(\mathbf{p} = (p_1, \dots, p_K)\), where \(\sum p_k = 1\). Suppose we place a Dirichlet prior on \(\mathbf{p}\) with hyperparameters \(\boldsymbol{\alpha} = (\alpha_1, \dots, \alpha_K)\). Show that the posterior distribution of the parameter vector \(\mathbf{p}\) given the data \(\mathbf{X}\) is a Dirichlet distribution with parameters \((\alpha_1 + n_1, \dots, \alpha_K + n_K)\).

Question 3

Let \(X \sim \text{Cat}(\mathbf{p})\) with \(k\) outcomes and probabilities \(\mathbf{p} = (p_1, \dots, p_k)\).

Identify the distribution followed by the indicator variable \(\mathbb{I}(X=j)\) and specify its parameters.

Instruction: Use an intuitive guess (layman’s terms) to determine the answer before starting the formal mathematical proof.

Question 4: Base calling (Simplified version)

We are analyzing a specific site in a DNA sequence which can be one of \(K=4\) distinct nucleotides (A, T, C, G). The probability of observing each nucleotide is given by the vector \(P = (p_A, p_T, p_C, p_G)\).

We treat the observation of \(N\) independent sites as drawing from a Categorical distribution. To model our uncertainty about the vector \(P\), we assign a Dirichlet prior with hyperparameters \(\alpha = (\alpha_A, \alpha_T, \alpha_C, \alpha_G)\).

Given:

Prior Belief:
- Dirichlet prior: \(\boldsymbol{\alpha} = (2, 2, 2, 2)\).
Observed Data (\(X\)): We observe \(N=20\) sequences with the following counts
- A: \(n_A = 10\)
- T: \(n_T = 5\)
- C: \(n_C = 0\)
- G: \(n_G = 5\) (Note: \(10+5+0+5=20\))

4.1

Write down the likelihood function for the observed data \(X\) given \(P\).

4.2

Based on the likelihood function (Maximum Likelihood Estimation), what are the estimates for \(\hat{p}_A\), \(\hat{p}_T\), \(\hat{p}_C\), and \(\hat{p}_G\)?

4.3

Derive the posterior distribution \(p(P|X, \boldsymbol{\alpha})\).

4.4

Calculate the posterior predictive probability of observing each nucleotide in a new sequence: \(p(\tilde{x}=A|X,\boldsymbol{\alpha})\), \(p(\tilde{x}=T|X,\boldsymbol{\alpha})\), \(p(\tilde{x}=C|X,\boldsymbol{\alpha})\), and \(p(\tilde{x}=G|X,\boldsymbol{\alpha})\).

4.5

Compare the posterior predictive probabilities with the MLE estimates from 4.2. How does the prior belief affect the estimation of rare events (e.g., \(n_C = 0\))?

4.6

After talking with a genomic researcher, you realize that this sequence is suspected to be a CpG island, where the frequency of ‘C’ and ‘G’ is higher. How would you adjust your prior belief to reflect this?

4.7

Consider an extreme case where the Dirichlet prior is \(\boldsymbol{\alpha}'' = (2, 2, 18, 18)\), strongly favoring ‘C’ and ‘G’. Calculate the new posterior predictive probabilities.

4.8

Compared with the case where \(\boldsymbol{\alpha} = (2, 2, 2, 2)\), how does this strong prior influence the final inference?